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Abstract: 


Compared with traditional statistics, only a few social scientists employ Bayesian analyses. The 
existing software programs for implementing Bayesian analyses such as OpenBUGS, WinBUGS, 
JAGS, and rstanarm can be daunting given that their complex computer codes involve a steep 
learning curve. In contrast, this paper introduces a new open software for implementing 
Bayesian network modelling and analysis: the bayesvl R package. The package aims at providing 
an intuitive gateway for beginners of Bayesian statistics to construct and analyse mathematical 
models in social sciences. To achieve this aim, the bayesvl package integrates three core 
functions seamlessly: (i) designing Bayesian network models using directed acyclic graphs 
(DAGs) of bnlearn, (ii) generating attractive visualization of ggplot2, and (iii) simulating data 
and computing posterior distribution using the Markov chain Monte Carlo (MCMC) algorithms 
of rstan and rethinking. A case example illustrates how the bayesvl package helps leverage users’ 
intuition in creating and evaluating mathematical models of their social scientific problems 
while minimizing the daunting aspect of writing complex computer codes. 
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Classification number: 7 


Introduction of social sciences and humanities. Firstly, the 
Bayesian inference process incorporates both 
mathematics and background knowledge in 
the model [2, 3]. Second, actual observations 


are the primary concern when checking the 


While there are many straightforward 
the 
widespread use of Bayesian statistics, it appears 


philosophical reasons _ for greater 


that, in practice, Bayesian statistical analysis strength of evidence, unlike the assumption 


is still limited to only a minority of social and o¢ ay 


infinite number of observations in 


behavioural scientists. The Bayesian approach 
is methodologically similar to the way one’s 
intuition works [1]. There are several factors that 
provide Bayesian statistics with strengths over 
traditional statistics, especially in the realm 
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conventional statistics. Moreover, even when 
background knowledge is incorporated, the 
final results can still be computed appropriately 
with one model able to be quantitatively 
compared with other models [3, 4]. Here, 
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the Bayesian approach enables researchers 
to utilise mathematical tools with intuition 
and common sense [5]. In essence, when 
using Bayesian statistics, a researcher will use 
prior distributions to reflect their initial beliefs 
(informed background knowledge, sometimes 
even intuition), then update their initial beliefs 
with new data to form posterior distributions 
[6-8]. However, social scientists are still more 
familiar with the conventional statistics [9]. 
For instance, studies in top sociology journals 
using Bayesian statistics are virtually non- 


existent [10]. 


One of the factors that contribute to this 
issue is the availability of friendly software 
for conventional statistics such as SPSS, Stata, 
or SAS. As for the lack of intuitive software, 
the current software packages for Bayesian 
analysis such as require OpenBUGS [11], 
JAGS [12], MCMCglmm [13], Stan [14], 
brms [15], rethinking [16], and rstanarm 
[17] require users to be highly familiar with 
a command-line interface. The users must 
code the model from scratch, which can entail 
a steep learning curve for many statistical 
novices. Thus, we created an R package called 
bayesvl to help social scientists focus on their 
research problems without being caught up 
in computational details. With relative ease- 
of-use and minimal coding, the bayesvl R 
package can help create Bayesian network 
models and automatically generate stan codes 
of the graphical structure of Bayesian networks 
to perform sampling and parameter learning. 
It can also perform the Hamiltonian MCMC 
simulations and provide multiple visualization 
tools such as a Bayesian network graph, a bar 
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chart for conditional probabilities, a trace plot, 
a Gelman shrink factor plot, an autocorrelation 
effect plot, a highest posterior distribution 
intervals plot, a density plot, a pairwise 
parameter comparison plot, etc. Moreover, 
it can produce a number of statistical and 
computational checks such as calculating 
conditional probabilities, the effective sample 
size, or r-hat statistics. Finally, it supports a 
detailed model comparison process with the 
Pareto-smoothed importance sampling leave- 
one-out cross-validation approach (PSIS-LOO- 
CV) [18], the widely applicable information 
(WAIC) 
stacking weights [4, 20]. In this paper, we will 


criterion [19], and the Bayesian 
briefly introduce the core functions of bayesvl 
and demonstrate its functions with a real-life 
example. 


The hayesvil R package 


The bayesvl project was launched in 2017 
[21, 22] and the package was eventually 
published in CRAN [23] and Github [24] in 
2019. There were several components that 
contributed to our conception of the software: 
the visualization capability of R and the 
causality and uncertainty inherent to Bayesian 
Network modelling [25]. Simulated data using 
the MCMC algorithm makes this package more 
scientific for social science research in the 
age of Big Data and is visually appealing and 
intuitive to the readers [26]. To this end, the 
bayesvl R package was developed referring to 
rethinking [16] and rstanarm [17] for MCMC 
272 28) 
Bayesian network modelling; and ggplot2 for 


simulation method; bnlearn for 


beautiful and flexible data visualization. 


The model fitting procedure of the bayesvl 


package is relatively simple. First, it is 
recommended that users install version 3.5.1 or 
more recent versions of R software. Then, the 
ggplot2, rstan, and the bayesvl packages need to 
be installed. Then, a diagram needs to be built 


to visualize the relationship between variables. 


Moreover, visualization supports four 


cognitive mechanisms: reinterpretation, 
abstraction, combination, and mapping [29, 
30]. The bayesvl package provides simple 
code to help researchers build the diagram. 
Other packages for Bayesian statistics such as 
brms, MCMCglmm, rstanarm, and rethinking 
tend to require the user to code mathematical 
formulae, which can be intimidating. Thus, the 
simplicity of bayesvl code helps researchers 
ease into to the world of Bayesian statistics. 
Next, the Stan code for model fitting can 
be created automatically with the bayesvl 


package’s link to rstan. Finally, the bayesvl 
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package allows the MCMC results to be viewed 
in both numbers and charts. 


In the following section, we will examine 
the model fitting procedures and graphical 
ability of the bayesvl package through a real- 
life example: a nested multi-level Bayesian 
network with varying intercept, which analyses 
how the satisfaction of medical patients are 
affected by their socioeconomic, residence, 
and insurance status, and the outcome of their 
medical treatment. The dataset for this real-life 
example was deposited in the Open Science 
Framework (OSF)’s depository and contained 
1042 observations on health insurance, health 
care, and socioeconomic status [31, 32]. 


Model construction 


The statistical problem at hand is to 


investigate the level of a patient’s financial 


(a) 


(ot) 


Fig. 1. A nested multi-level Bayesian network with varying intercepts analyses how a patient’s financial 
burden is associated with their residence, length of hospital stay, employment condition, and the amount 


of ‘thank you money’. 
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burden given their socio-residence _ status, 
length of hospital stay, employment condition, 
and the amount of ‘thank you money’ included 
in the treatment fee. First, one can create a 
visual representation of the relationships among 
the variables, i.e., a relationship tree. Fig. 1 
presents a visualization of this example. 


Explanatory variables: 
e Res: whether a patient lives in the same 
region as the hospital; yes=1; no=2. 


e Stay: the amount of time a patient stays at 
the hospital; under 10 days (short)=0; or more 
than 10 days (long)=1. 


the condition 


of the patient; 


e Jcond: employment 


stable=3; unstable=2; or 


unemployed=1. 


e Envi: the amount of ‘thank you money’ 
that the patient must pay with the treatment 
fee; high (>15%)=3; medium (7-15%)=2; low 
(<7 %)= 11 

Response variable: 

e Burden: the self-reported financial situation 
of the patient after the treatment; minimally 
affected (A)=1; (B)=2; 
destitute (C)=3; adversely destitute (D)=4. 


adversely affected 


As one can see, the model is a nested 
multi-level Bayesian network, which enables 
simultaneous analyses of individual quantities 
to be performed [2]. This is also a direct 
example of how Bayesian modelling leverages 
the background knowledge of social science 
It is true that data sets that 


pool data over multiple units are commonplace 


researchers [9]. 
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in social sciences; for example, people can 
attend different schools, companies trade in 
different markets, and votes live in different 
structures are 


communities. These causal 


captured in Bayesian hierarchical models. 


In a simple regression model, we have the 


following mathematical formula: 


y =a Bx re (1) 
in which €,~ N(O, o). 
With the varying intercept multi-level 


model, the mathematical formula has the 


following form: 
(2) 


y =O, +Bx;+e, 


Applying Eq. (2) to the model in Fig. 1, we 


have: 
 ondealiy Opec Xpestiy) 7 B stay Hall 
+ Pista tethd fi] T Dine eat Oi] + S hardeatil (3) 


In the model, B and B,.., are three 


stay’ Pieced 


regression coefficients indicating the slopes of 
regression lines or the rate of change in y, 
as X and x 


x change, respectively, 


stay’ Jcond/ Envl 


while a,.. stands for a varying intercept. 

Once the user has a clear picture of the causal 
and correlational structure of the variables, 
the next step is to code the Bayesian network 
model. The following box presents how the 


nodes and arcs are added to the model. 


# Add nodes to model 
model <- bayesvl() 
model <- bvl_addNode(model, “Burden”, 
model <- bvl_addNode(model, “Stay”, “binom”) 
model <- bvl_addNode(model, “Jcond”, “norm”) 
( 
( 


(aL 


norm”) 


model <- bvl_addNode(model, “EnvL”, “norm”) 


ad 


model <- bvl_addNode(model, “Res”, “cat”) 
# Add arcs to model 
model <- bvl_addArc(model, “Stay”, “Burden”, “slope”) 


( 

model <- bvl_addArc( slope”) 
( 
( 


(idee 


model, “Jcond”, “Burden”, 
model <- bvl_addArc(model, “EnvL”, “Burden”, “slope”) 


model <- bvl_addArc(model, “Res”, “Burden”, “varint”) 


A relational diagram is constructed with two 
commands: bvl_addNode and bvl_addArc. 
When using bvl_addNode, 
distribution of any variable is chosen with 


the statistical 
“norm” for normal distribution, “binom” for 
binominal distribution, or “cat” categorical 
distribution. the bvl_ 


addArc is used for setting the mathematical 


Meanwhile, code 
relationship between two nodes, for example, 
with a fixed-effect model (“slope”), varying 
varying 
model (“varslope”), or a mixed-effect model 


slope model (“varint”), intercept 
(“varpars”). For multilevel modelling, random- 
intercept (“varslope”) and mixed-effect models 


(“varpars”) are used. 
Automatic generation of stan codes 


The bayesvl package allows the users to 
easily generate the stan codes for the model 
using the following commands: 


# Generate the stan code for model 
model_string <- bvl_model2Stan(model) 
cat(model_string) 


As a result, we have the following stan codes 
automatically generated from building the 
graphical structure of the Bayesian network. 
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functions{ 
int numLevels(int[] m) { 
int sorted[num_elements(m)]; 
int count = 1; 
sorted = sort_asc(m); 
for (i in 2:num_elements(sorted)) { 
if (sorted[i] != sorted |[i-1]) 
count = count + 1; 
} 


return(count); 


} 
data{ 

// Define variables in data 

int<lower=1> Nobs; // Number of observations (an 
integer) 

real Burden[Nobs]; // outcome variable 

int<lower=0,upper=1> Stay[Nobs]; 

real Jcond[Nobs]; 

real EnvL[Nobs]; 

int NRes; 

int<lower=1 ,upper=NRes> Res[Nobs]; 
} 
transformed data{ 
// Define transformed data 


parameters{ 
// Define parameters to estimate 
real<lower=0> sigma_Burden; 
real b_Stay_Burden; 
real b_Jcond_Burden; 
real b_EnvL_ Burden; 
real aO_Res; 
real<lower=0> sigma_Res; 
vector[NRes] u_Res; 


transformed parameters{ 
// Transform parameters 
real mu_Burden[Nobs]; 
vector[NRes] a_Res; 
// Varying intercepts definition 
for(k in 1:NRes) { 

a_Res[k] = aO_Res + u_Res[k]; 
} 


for (i in 1:Nobs) { 


MCMC _ simulation and 


diagnostics 


convergence 


Next, using the following commands, we 
can fit the model and execute the MCMC 
estimation for the model. 


# Fit the model 

model <- bvl_modelFit(model, dat, warmup = 2000, 
iter = 5000, chains = 4, cores = 4) 

summary(model) 


The summary of the model is given in Table 1. 
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Table 1. Simulation result. 


Mean SD n_eff Rhat 
b_Stay_Burden 0.22 0.05 8933 1 
b_Jcond_Burden S000) 0.04 873 1 
b_EnvL_Burden -0.06 0.03 8613 1 
a_Res[Yes] 2.76 O12 7895 1 
a_Res[No] 3.58 Oo 7846 | 
aO_Res DENTE eialis) lis6s ie) 
sigma_Res 3.66 3:82 213 | 


Two values were used to diagnose the 
convergence of the model: n_eff (effective 
sample size) and Rhat value. In standard 
practice, n_eff is good when it reaches above 
1000 samples. Rhat is roughly 1, which 
means all chains have the same distribution. 
Meanwhile, the model should be checked 
again when Rhat is greater than 1.1. In Table 
1, n_eff is greater than 1400 and Rhat equals 1, 


b_Stay_Burden 


0.0 
2000 3000 4000 2000 3000 


a_Res[1] 


3000 4000 5000 2000 3000 


sigma_Res 


2000 3000 4000 5000 


Fig. 2. 
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b_Jcond_Burden 


a_Res[2] 


which suggests that the model is successfully 
converged. 


A fundamental aspect of the MCMC 
approach is to visually diagnose the 
convergence key indicators such as Markov 
chains, the Shrink factor, andthe autocorrelation 
phenomenon. The following examples help us 
understand this visual diagnostics process. 


Figure 2 presents the trace plots for all 


b_EnvL_Burden 


4000 5000 2000 3000 4000 5000 


a0_Res 


4000 5000 2000 3000 4000 5000 


The trace plot for each variable of the Bayesian regression model. 
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Fig. 3. The Gelman plot for each variable of the Bayesian regression model. 


model variables, which show no divergent eventually stops. 
chains. The figure can be generated with the 


Figure 5 provides technical validation for 
command bvl_plotTrace. 


the posterior distribution of each variable in 
Figure 3 shows the convergence of all the model. The illustration can be generated 
variables in the model according to the Gelman using bvl_plotParams. 
shrink factor (or potential scale reduction 
factor). To generate the visualization, the 
command bvl_plotGelmans is used. Similarly, (PSIS) diagnostic plot in Fig. 6 illustrates the 
Figs. 4 and 5 also illustrate different values model’s goodness-of-fit given the underlying 
for technical validation. data. The PSIS diagnostic plot is used together 


The Pareto-smoothed importance sampling 


Figure 4 presents the autocorrelation factor with the leave-one-out cross-validation (LOO) 
(ACF) for each variable in the model (using technique to validate the model’s performance. 
the command bvl_plotAcfs). Technically, _Inthebayesvl package, the LOO-PSIS diagnostic 


the MCMC algorithm produces samples can be used with the following command: 
that correlate with each other but are not 
independent. Therefore, the ACF plot allows 
researchers to check whether autocorrelation | P!ot(!oo) 


loo<-bvl_stanLoo(model) 
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Fig. 4. The autocorrelation factor plot for each 
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Fig. 5. The parameter plot for each variable of the Bayesian regression model. 
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PSIS diagnostic plot 


100 


Fig. 6. The LOO plot for each variable of the Bayesian 
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regression model. 
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Fig. 7. The interval plot of the coefficients’ posterior distribution. 


Regression results 


The range of posterior distribution (Fig. 7) 
can be visualized using the command bvl_ 
plotintervals. 


04 


Or we can view the coefficients in a different 


style using the command _ bvl_plotDensity 
(Fig. 8). 
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Density 


Value 


Fig. 8. Plotting the coefficients’ posterior distribution. 


As demonstrated in the example above, 
the bayesvl R package can be useful for social 
sciences and humanities researchers to gain 
familiarity with Bayesian statistics. First, the 
bayesvl R package allows users to visualize 
their model in the form of a Bayesian network, 
and makes it is relatively easy to code the 
graphical structures of any Bayesian network 
with only two commands: bvl_addNode and 
bvl_addArc. Bayesian network modelling 
is suitable for handling problems due to 
the veracity of data [25] as well as ensuring 
the examination of the models. As a visual 
presentation, researchers can visually inspect 
the formulated correlational structures [33, 
34]. With the link to ggplot2, bayesvl allows 
quick and easy visualization of the data as 
well as of the posterior distribution. Good data 
visualization can help researchers quickly 
identify errors in the data [35] and point them 
toward possible causal/correlational structures 
in the data. Moreover, bayesvl can help 
researchers with the Stan code, especially new 
learners who are not familiar with the coding 
language of Stan. 
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B. Aczel, et al. (2020) [3] observed that 
selecting from a wide variety of inference 
tools in Bayesian analysis can be confusing 
and D. Spiegelhalter (2019) [2] also noted 
the daunting computational nature of modern 
Bayesian analysis, its lack of established and 
widely accepted criteria for significance, and 
its lack of user-friendly software. However, 
there have been many attempts to analyse 
the practice and philosophy of Bayesian data 
analysis to point toward a more established 
framework of Bayesian inference. For 
example, J. Gabry, et al. (2019) [36] argued 
that Bayesian data analysis, in practice, is an 
iterative process of model building, inference, 
model checking and evaluation, and finally 
model expansion. In each stage, visualization 
is indispensable. However, A. Gelman and 
C.R. Shalizi (2013) [37] argued that the most 
successful forms of Bayesian statistics do not 
support that particular philosophy but align 
more with a form of ‘hypothetico-deductivism’. 
Similarly, S.E. Lazic, et al. (2020) [38] argued 
how a Bayesian approach can deal with the 
problem of pseudo-replication. In building 
this software package, we hope to contribute 
to the growing spread of Bayesian statistics 


applications in the social sciences and 
humanities. We believe that an appreciation 
for testing and trying new methodologies will 
make social sciences and humanities studies 
more scientific and reproducible [39, 40]. 
Compared with the conventional frequentist 
regression, Bayesian regression modelling is 
better at dataset analysis of small sample sizes 
[41]. Moreover, the Bayesian approach offers a 
wide range of tools to construct, compare, and 
choose models based on their predictability 
performance [8]. In this new era of Al and 
Big Data, reproducibility and transparency 
are two values one must uphold, which will 
greatly reduce the cost of science [42]. 
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